In order to integrate a new feature of on/off target levels of coverage in NGSphy, I needed to understand coverage variation that one could find in a capture experiment. Goals are basically:
Data used comes from Rute’s Hummingbird project. Most of the information reported here about the samples and targets was extracted from the paper.
BAM files corresponding to the mappings of the targeted-sequencing to 2 references.
map2Anna).map2Swift).Workspace triploid (UVigo):
triploid.uvigo.es/home/merly/ research/cph-bam-coverageWorkspace randy (KU):
randy.popgen.dk/home/merly/anna/home/merly/swiftTargeted regions are exons, retrieved a posteriori, due to the fact that the original probes for this project were lost during a flood.
Ingroup reference: Original target files given was a GFF file: Calypte_anna.gene.CDS.2750.gff [1.6M]. This file was converted into a BED file, keeping only chromosome (scaffold), start and end position of the targets.
Outgroup reference:: For this one, I received 2 files. One that had the known one-to-one ortholog relation between the genes from Anna and Swift (48birds_ortholog.list.chi.anna.cpe.hum.finch [656K]). The second, is a GFF file, which contains the exons from Swift (Chaetura_pelagica.CDS.gff[8.5M]). I had to get the targeted regions from the GFF files, taking into account that I needed the genes matching to Anna, and also I needed those genes that were actually used in the Anna’s targets (Calypte_anna.gene.CDS.2750.gff):
48birds_ortholog.list.chi.anna.cpe.hum.finch.Chaetura_pelagica.CDS.gff where codon regions (CDS).Calypte_anna.gene.CDS.2750.gff so I’ll have the same covered genes.Chaetura_pelagica.CDS.gff and keep those regions matching both Anna and Swift.Finally, datasets will be filter to match gene number.
birds48filename=paste0(WD,"48birds_ortholog.list.chi.anna.cpe.hum.finch")
swiftgff=paste0(WD,"Chaetura_pelagica.CDS.gff")
annagff=paste0(WD,"Calypte_anna.gene.CDS.2750.gff")
birds=read.table(birds48filename, header=T,
colClasses=c("character","numeric","character","character",
"character","character","character"))
swift=read.table(swiftgff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
anna=read.table(annagff,
colClasses=c("character","character","character","numeric",
"numeric","character","character","character","character"))
birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(anna$V9[birds.5$anna %in% anna$V9])
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
swiftGenes=birds.6$swift
swift.2=swift[swift$V3=="CDS",]
swift.3=swift.2[swift.2$V9 %in% birds.6$swift,]
swift.3$size=swift.3$V5-swift.3$V4
swift.4=swift.3[swift.3$size>0,]
anna.2=anna[(anna$V5-anna$V4)>0,]
swift.4=swift.4[,1:(ncol(swift.4)-1)]
anna.3=anna.2[anna.2$V9 %in% unique(birds.6$anna),]
anna.3$targetname=paste0( anna.3$V1,
rep("-", nrow(anna.3)),
anna.3$V4,
rep("-", nrow(anna.3)),
anna.3$V5)
swift.4$targetname=paste0( swift.4$V1,
rep("-", nrow(swift.4)),
swift.4$V4,
rep("-", nrow(swift.4)),
swift.4$V5)
This is a quantitative summary description of the resulting target files. The difference between number of genes, and subsequently in number of targets and covered genome, in both Anna and Swift, is not precisely alarming. The experiment was done using known ortholog genes, which not necessarily match in number of exons (targets) or their size. There are two (2) Anna datasets, the original and the reduced (1,469 genes) and the Swift dataset.
| Number of targets | Size of targeted genome (bp) | Number of genes | Total of scaffolds | |
|---|---|---|---|---|
| Anna - Original | 24,044 | 3,934,867 | 2,750 | 389 |
| Anna - Reduced | 14,740 | 2,285,746 | 1,469 | 314 |
| Swift | 14,764 | 2,279,626 | 1,469 | 372 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna - Original | 104 | 724 | 1,136 | 1,431 | 1,800 | 13,150 |
| Anna - Reduced | 110 | 842 | 1,285 | 1,556 | 1,953 | 13,150 |
| Swift | 110 | 843 | 1,281 | 1,552 | 1,943 | 13,000 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Anna - Original | 1 | 90 | 125 | 163.7 | 170 | 5,246 |
| Anna - Reduced | 1 | 90 | 125 | 163.2 | 169 | 5,246 |
| Swift | 1 | 89 | 123 | 154.4 | 166 | 5,240 |
Target datasets used from now on will contain the same number of genes (1,469).
I obtained the coverage of the targeted regions using bedtools[9] (v. 2.22.0), module coverage. With this, and the option -hist I can report a histogram of coverage for each feature in A as well as a summary histogram for _all_ features in A. Output (tab delimited) after each feature in A (from bedtools documentation))
# bases at depthfor bamfile in $(find $HOME/anna/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "Calypte_anna.gene.CDS.2750.3.gff" | gzip > $HOME/anna/bedtools/cov/${tag}.cov.gz &
done
for bamfile in $(find $HOME/swift/originals -name "*.bam"); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "Chaetura_pelagica.CDS.2.gff" | gzip > $HOME/swift/bedtools/cov/${tag}.cov.gz &
done
And so, I filtered the output to keep the coverage per region and the summary histogram separated.
# (i.e)
for tag in $(cat $HOME/anna/files/samples.txt ); do
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist/$tag.nohist.gz &
nohup zcat $HOME/anna/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist/$tag.hist.gz &
done
for tag in $(cat $HOME/swift/files/samples.txt | tail -n+2 | head -46 ); do
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist/$tag.nohist.gz &
nohup zcat $HOME/swift/bedtools/cov/H09.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist/$tag.hist.gz &
done
With the information extracted from the bedtools coverage -hist, we can see the relation between the breadth of coverage obtained and the depth per sample.
| Map2 - target | Coverage per target overview |
|---|---|
| map2Anna | |
| map2Swift |
Zoom: 0x-100x.
| map2Anna | map2Swift |
|---|---|
From the bedtools coverage output I was able to extract the depth of coverage per target, gene and sample. I generated 2 matrices:
Each cell of the matrix correspond to the sum of the number of times each base was covered within the target or the gene. Then, coverage was calculated as the mean value of all the depth of coverage of all the samples for a specific target/gene divided by the size of the corresponding target/gene.
For the depth of coverage for the samples, I summed the coverage of all targets and divided it by the total amount of bases that were targeted.
Note: Outliers presented below were calculated using the following:
qnt <- quantile(data, probs=c(.25, .75))
H <- 1.5 * IQR(data)
y <- data
y[data < (qnt[1] - H)] <- "Outlier"
y[data > (qnt[2] + H)] <- "Outlier"
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 2.115 | 47.76 | 77.73 | 101.3 | 115.8 | 310.2 |
| map2Swift | 1.715 | 38.27 | 62.29 | 83.8 | 98.07 | 265.6 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 12.05 | 72.38 | 98.38 | 105 | 133.5 | 290.3 |
| map2Swift | 0.8245 | 54.24 | 81.06 | 85.55 | 111.5 | 702.6 |
Identification and removal of the outliers
Number of outliers genes and coverage distribution
| Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
|---|---|---|---|---|---|---|---|
| map2Anna | 21 | 225.2 | 237.3 | 251.2 | 255.5 | 272.8 | 290.3 |
| map2Swift | 19 | 199.1 | 207.3 | 215.7 | 246.4 | 234.3 | 702.6 |
| map2Anna Gene | map2Anna Size | map2Swift Gene | map2Size Gene |
|---|---|---|---|
| Parent=Aan_R000557; | 1025 | Parent=Cpe_R000257; | 1008 |
| Parent=Aan_R000645; | 1400 | Parent=Cpe_R001479; | 794 |
| Parent=Aan_R000757; | 1675 | Parent=Cpe_R001695; | 145 |
| Parent=Aan_R000784; | 1089 | Parent=Cpe_R002655; | 431 |
| Parent=Aan_R000837; | 2118 | Parent=Cpe_R003268; | 137 |
| Parent=Aan_R001079; | 884 | Parent=Cpe_R003836; | 461 |
| Parent=Aan_R001728; | 598 | Parent=Cpe_R004636; | 640 |
| Parent=Aan_R001861; | 604 | Parent=Cpe_R005332; | 1643 |
| Parent=Aan_R001895; | 3276 | Parent=Cpe_R008000; | 282 |
| Parent=Aan_R002476; | 896 | Parent=Cpe_R008889; | 709 |
| Parent=Aan_R003858; | 992 | Parent=Cpe_R011470; | 349 |
| Parent=Aan_R004237; | 3287 | Parent=Cpe_R012383; | 1214 |
| Parent=Aan_R004624; | 801 | Parent=Cpe_R012980; | 925 |
| Parent=Aan_R005273; | 508 | Parent=Cpe_R013031; | 753 |
| Parent=Aan_R005275; | 340 | Parent=Cpe_R013115; | 774 |
| Parent=Aan_R006006; | 360 | Parent=Cpe_R013182; | 2619 |
| Parent=Aan_R006018; | 584 | Parent=Cpe_R013834; | 743 |
| Parent=Aan_R006633; | 1151 | Parent=Cpe_R014755; | 608 |
| Parent=Aan_R007365; | 2134 | Parent=Cpe_R014832; | 1370 |
| Parent=Aan_R007414; | 1060 | ||
| Parent=Aan_R008340; | 964 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 12.05 | 72.16 | 98.54 | 105 | 133.5 | 290.3 |
| map2Swift | 0.8245 | 54.06 | 80.49 | 83.44 | 109.8 | 195.4 |
I’m looking for a relation between gene size and depth of coverage. So I fit a linear model to the data.
##
## Call:
## lm(formula = coveragePerGeneAnnaFiltered ~ genesAnnaFiltered$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.192 -32.445 -5.977 28.228 178.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.955749 2.051036 54.585 < 2e-16 ***
## genesAnnaFiltered$Size -0.004442 0.001065 -4.173 3.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.05 on 1454 degrees of freedom
## Multiple R-squared: 0.01183, Adjusted R-squared: 0.01115
## F-statistic: 17.41 on 1 and 1454 DF, p-value: 3.19e-05
##
## Call:
## lm(formula = coveragePerGeneSwiftFiltered ~ genesSwiftFiltered$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.810 -29.445 -2.706 26.272 111.522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.8002830 1.7674363 47.979 <2e-16 ***
## genesSwiftFiltered$Size -0.0008706 0.0009156 -0.951 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.58 on 1448 degrees of freedom
## Multiple R-squared: 0.000624, Adjusted R-squared: -6.614e-05
## F-statistic: 0.9042 on 1 and 1448 DF, p-value: 0.3418
I need to understand how to interpret this
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 0 | 34.12 | 84.28 | 167.6 | 178.3 | 42,890 |
| map2Swift | 0 | 23.04 | 64.59 | 136.6 | 144.9 | 56,610 |
Identification and removal of the outliers
Number of outliers targets and coverage distribution
| Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
|---|---|---|---|---|---|---|---|
| map2Anna | 1,066 | 0 | 31.09 | 75.25 | 102 | 150 | 394.5 |
| map2Swift | 1,096 | 0 | 20.64 | 57.1 | 80.72 | 119.5 | 327.2 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| map2Anna | 0 | 31.09 | 75.25 | 102 | 150 | 394.5 |
| map2Swift | 0 | 20.64 | 57.1 | 80.72 | 119.5 | 327.2 |
Coverage distribution after removing outliers
| Num. outliers | Min. | 1st. Qu. | Median | Mean | 3rd. Qu. | Max. | |
|---|---|---|---|---|---|---|---|
| map2Anna | 810 | 0 | 34.75 | 80.78 | 106.3 | 155.3 | 394.5 |
| map2Swift | 814 | 0 | 23.12 | 60.83 | 84.11 | 124.2 | 327.2 |
Size distribution after removing outliers
I’m looking for a relation between target size and depth of coverage.
##
## Call:
## lm(formula = anna.6$Coverage ~ anna.6$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.95 -65.13 -20.57 47.08 337.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168.43121 2.13600 78.85 <2e-16 ***
## anna.6$Size -0.48399 0.01548 -31.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.31 on 12742 degrees of freedom
## Multiple R-squared: 0.07122, Adjusted R-squared: 0.07115
## F-statistic: 977.1 on 1 and 12742 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = swift.7$Coverage ~ swift.7$Size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.95 -56.29 -19.71 39.86 279.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 126.28067 1.81543 69.56 <2e-16 ***
## swift.7$Size -0.32870 0.01316 -24.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.28 on 12734 degrees of freedom
## Multiple R-squared: 0.0467, Adjusted R-squared: 0.04663
## F-statistic: 623.8 on 1 and 12734 DF, p-value: < 2.2e-16
I need to understand how to interpret this
While there is no specific color key for the coverage values, this will only serve as an overview of the coverage per target/gene per sample.
>0x.| Full (1.469 Genes) | Without Outliers |
|---|---|
| Parent=Cpe_R000823; | Parent=Cpe_R000823; |
| Parent=Cpe_R008410; | Parent=Cpe_R008410; |
| Parent=Cpe_R008909; | Parent=Cpe_R008909; |
This is a summary of the target regions that were not captured by a single base, per sample.
| Dataset | Number of common target regions per sample | Percentage (common target/total target regions) | |
|---|---|---|---|
| Within map2Anna samples | Full dataset | 268 | 1.81818181818182 |
| Within map2Anna samples | Outliers removed | 268 | 2.10295040803515 |
| Within map2Swift samples | Full dataset | 153 | 1.0363045245191 |
| Within map2Swift samples | Outliers removed | 153 | 1.20131909547739 |
| Across datasets | Full Dataset | 0 | 0 |
| Across datasets | Outliers removed | 0 | 0 |
map2Anna target file, and coordinates are differentmap2Swift dataset, due to the lower number of targets in general, even though the percentages are similar (1%)| Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | |
|---|---|---|---|---|---|---|
| Anna | 1 | 91 | 122 | 128.4 | 160 | 285 |
| Swift | 1 | 91 | 122 | 128.3 | 160 | 285 |
First, is important to define what is an off-target region. For the purpose of this analysis, there are two (2) definition of an off-target region.
Calypte_anna.gene.CDS.2750.2.gff and the regions in Chaetura_pelagica.CDS.2.gff, for their corresponding dataset.
I’m assuming that the all the possible DNA captured (on/off-target) is defined by a set of scaffolds in each GFF file. The information that we know is the position of the codons per dataset (represented in the darker box, bottom), the Calypte_anna.gene.CDS.2750.gff file, for the map2Anna dataset and the Chaetura_pelagica.CDS.2.gff for the map2Swift dataset. These positions are relative to the scaffold that represents a region of the genome that was able to be assembled. We also have the relation of the codons to their corresponding genes.
Off-target generation
So, in order to know what was the size of the off-target regions, I first needed to obtain the size of the possible DNA to be captured, meaning on and off target. To do so, I (per file):
WD="/media/merly/Baymax/research/cph-visit/coverage-analysis/"
allgenes=paste0(WD,"files/Calypte_anna.gene.CDS.gff")
anna=read.table(allgenes, colClasses = c("character","character","character",
"numeric","numeric","character",
"character","character","character"))
swiftfull=paste0(WD,"files/Chaetura_pelagica.CDS.gff")
swift=read.table(swiftfull, colClasses = c("character","character","character",
"numeric","numeric","character",
"character","character","character"))
sAnnaAll=split(anna,anna$V1)
sSwiftAll=split(swift,swift$V1)
maxPositionPerScaffoldAnna=sapply(sAnnaAll,function(x){max(x$V5)})
maxPositionPerScaffoldSwift=sapply(sSwiftAll,function(x){max(x$V5)})
maxLen=max(
length(maxPositionPerScaffoldAnna),
length(maxPositionPerScaffoldSwift))
unionScaffolds=union(
names(maxPositionPerScaffoldAnna),
names(maxPositionPerScaffoldSwift))
df=data.frame(
anna=rep(0,length(unionScaffolds)),
swift=rep(0,length(unionScaffolds)))
rownames(df)=unionScaffolds
df[names(maxPositionPerScaffoldAnna),]$anna=maxPositionPerScaffoldAnna[names(maxPositionPerScaffoldAnna)]
df[names(maxPositionPerScaffoldSwift),]$swift=maxPositionPerScaffoldSwift[names(maxPositionPerScaffoldSwift)]
dfAnna=df$anna[df$anna!=0]
names(dfAnna)=rownames(df)[df$anna!=0]
finalBed=cbind(names(dfAnna),rep(1,length(dfAnna)),dfAnna)
write.table(finalBed,paste0(WD,"files/scaffolds.anna.bed"), row.names = F,col.names = F,quote = F, sep="\t")
dfSwift=df$swift[df$swift!=0]
names(dfSwift)=rownames(df)[df$swift!=0]
finalBed=cbind(names(dfSwift),rep(1,length(dfSwift)),dfSwift)
write.table(finalBed,paste0(WD,"files/scaffolds.swift.bed"), row.names = F,col.names = F,quote = F, sep="\t")
bedtools subtract (searches for features in B that overlap A. If an overlapping feature is found in B, the overlapping portion is removed from A and the remaining portion of A is reported) to removed all possible targets from the GFF files out of their corresponding scaffolds file, obtaining two (2) BED files, one per dataset.bedtools subtract -a scaffolds.anna.bed -b $gffAnna > "$HOME/files/offtarget.anna.bed" &
bedtools subtract -a scaffolds.anna.bed -b $gffSwift > "$HOME/files/offtarget.swift.bed" &
General overview of scaffolds and resulting off-target regions
| Number of scaffolds | Total Genome Size (scaffolds) | Number of resulting targets | Total Genome Size (Off-target) | |
|---|---|---|---|---|
| Anna | 1,229 | 1,000,821,032 | 15,929 | 998,519,317 |
| Swift | 1,189 | 1,040,991,443 | 15,906 | 1,038,695,864 |
General overview of the size distribution of off-target regions
| Unit | Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | |
|---|---|---|---|---|---|---|---|
| Anna | Scaffold | 106 | 1,664 | 17,380 | 814,300 | 627,000 | 23,300,000 |
| Swift | Scaffold | 103 | 8,954 | 93,180 | 875,500 | 767,800 | 19,070,000 |
| Anna | Target | 1 | 523 | 1,089 | 62,690 | 3,092 | 5,756,000 |
| Swift | Target | 1 | 553 | 1,151 | 65,300 | 3,544 | 5,739,000 |
The idea is to identify the off-target regions, as well as the size that covers and how is the coverage distribution within this regions compared to the on-target coverage.
To obtain coverage information from the off-target regions:
for bamfile in $(find $originalsAnna -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.anna.bed" | gzip > $HOME/anna/bedtools/cov2/${tag}.cov.gz &
done
for bamfile in $(find $originalsSwift -name "*.bam" ); do
tag=$(echo $(basename $bamfile) | tr "_" " " | awk '{print $1}')
nohup bedtools coverage -hist -abam $bamfile -b "$HOME/files/offtarget.swift.bed" | gzip > $HOME/swift/bedtools/cov2/${tag}.cov.gz &
done
# hist split
for tag in $(cat $HOME/anna/files/samples.txt ); do
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/anna/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/anna/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/anna/bedtools/hist2/$tag.hist.gz
done
for tag in $(cat $HOME/swift/files/samples.txt ); do
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep -v ^all | gzip > $HOME/swift/bedtools/nohist2/$tag.nohist.gz
zcat $HOME/swift/bedtools/cov2/$tag.cov.gz | grep ^all | gzip > $HOME/swift/bedtools/hist2/$tag.hist.gz
done
I followed what has been done for the on-target regions to get this information. Getting how much of the off-target regions was captured, and at which depth.
map2Anna |
map2Swift |
|---|---|
| X-axis: 0-20 | X-axis: 0-20 |
map2Anna dataset, for both off-target sets breadth vs. depth looks the same. Same situation with the mat2Swift dataset.map2Anna, we see that the general depth is low, 25% of the off-target regions are covered at 1x for most of the samples.map2Swift, we see that there are 3 situations:
AnnaBGI and SwiftBGI in the map2Anna datasets, is because these samples were not in that mapping.| Min. | 1st. Qu. | Median | Mean | 3rd. Qu | Max. | Var. | Std. Dev | |
|---|---|---|---|---|---|---|---|---|
| map2Anna | 0 | 1.518 | 7.4 | 48.29 | 21.89 | 34,600 | 243,841.7 | 493.8033 |
| map2Swift | 0 | 0.6185 | 2.794 | 34.51 | 9.069 | 37,860 | 337,580.14666756 | 581.016477105048 |
I am trying to analyze if there is any correlation between the depth of coverage and the phylogenetic distance from the reference species, used for the probe generation and mapping.
Information of the phylogenetic reconstruction I got from the Hummingbirds Paper, this says that it was made with mitochondrial genome and nuclear gene trees using either a concatenation approach with RAxML [3] or the multi-species coalescent approach implemented in ASTRAL [4] and ASTRAL-II [5].
The phylogenies were built using subsets of the nuclear genes present the same topology between the main groups of hummingbirds with high confidence. The three subsets correspond to:
This is the Supplementary Table S5 (from the Hummingbirds Paper), describing the subsets selected:
| number of genes | minimum number of sites per species (concatenated) | maximum number of sites per species (concatenated) | includes outgroup | ASTRAL score |
|---|---|---|---|---|
| 741 | 949,470 | 1,542,750 | yes | 87% |
| 1987 | 1,667,071 | 2,783,832 | yes | 78% |
| 2949 | 2,473,664 | 3,792,500 |
Maximum likelihood tree
Cpe, outgroup reference.Aan, ingroup reference.For the analysis shown above as well as for phylogeny reconstruction using the dataset (1) with 2,949 genes.
To obtain the distance matrix I used the tree obtained with RAxML, from the concatenation of 2,750 genes. The tree gives me the branch lengths in number of substitutions per site. So, the distance calculated here, is the pairwise distance of the tips. This was done with the R package phangorn [7], and its cophenetic function.
sampleColors = colorRampPalette(brewer.pal(9, "Set1"))(48)
treefile="/home/merly/git/cph-visit/coverage-analysis/capture-phylotenetic-decay/trees/RAxML_bipartitions.2011.concat"
tree=read.nexus(treefile)
distMatrix=cophenetic(tree)
dCpe=distMatrix["Cpe",]
dAan=distMatrix["Aan",]
distMatrix[upper.tri(distMatrix)]=NA
Following the figure F3A in Braggs’s paper [8], where it shows sequencing coverage as a function of genetic divergence. I have calculated the depth of coverage for both datasets and phylogenetic distance. In the plot below you can see the median target coverage per sample.
## Warning in matrixAnna/targets$Size: longer object length is not a multiple
## of shorter object length
After genes with missing data were removed leaving 2,750 genes, and after matching those genes to the ones available in Swift, 1,469.
For the phylogenetic distance vs. coverage analysis, I will used a smaller dataset (2), with 1,987 genes, which I will intersect with those from which I do have the coverage information.
To do that, I generated a new dataset from the intersection of the genes found in Calypte_anna.gene.CDS.2750.3.gff (1,469), which are those ortholog genes matching across datasets, and the available gene trees (1,988).
cat "$WD/files/Calypte_anna.gene.CDS.2750.3.gff" | cut -f9 | uniq | tr "_" " " | awk '{print $2}' | tr ";" " " > g.trees.to.filter.txt
for gtreefile in $(cat $WD/files/g.trees.to.filter.txt); do
cp "$WD/trees/gtrees/$gtreefile" "$WD/trees/gtrees.2/"
done
R001806, R005562, R009279, R014527, R014590.ls "$WD/trees/seqs" | tr "." " " | awk '{print $1}' > "$WD/files/g.trees.to.filter.seqs.2.txt"
mkdir "$WD/trees/seqs.2"
for gtreefile in $(cat $WD/files/g.trees.to.filter.seqs.2.txt); do
filename="$WD/trees/seqs/${gtreefile}.x.aa.aln.backTranslated.gz"
cp $filename "$WD/trees/seqs.2/"
done
Anna and to Swift.geneList=unlist(read.table(paste0(WD,"files/g.trees.to.filter.seqs.2.txt"), stringsAsFactors=F))
distancesAnna=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
distancesSwift=matrix(rep(0,length(geneList)*length(sampleNamesSwift)), length(geneList), length(sampleNamesSwift))
rownames(distancesAnna)=geneList; colnames(distancesAnna)=sampleNamesSwift;
rownames(distancesSwift)=geneList; colnames(distancesSwift)=sampleNamesSwift
gtreefiles <- list.files(
path = paste0(WD,"trees/seqs.2/"),
pattern="*.x.aa.aln.backTranslated")
gtreefiles=paste0(WD,"trees/seqs.2/",gtreefiles)
labelsCorrect=c("H05","H01","H08","H03","H04","H06","H02","H07")
labelsWronglyFormated=c("H5","H1","H8","H3","H4","H6","H2","H7")
for (i in 1:length(gtreefiles)) {
print(i)
msa=as.phyDat(read.FASTA(gzfile(gtreefiles[i])))
hammDist=as.matrix(dist.hamming(msa))
indexAnna=which(startsWith(rownames(hammDist),"Aan"))
indexSwift=which(startsWith(rownames(hammDist),"Cpe"))
namesIds=which(rownames(hammDist) %in% labelsCorrect)
rownames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
colnames(hammDist)[c(indexAnna, indexSwift, namesIds)]=c("AnnaBGI","SwiftBGI",labelsWronglyFormated)
gene=strsplit(basename(gtreefiles[i]),"\\.")[[1]][1]
distancesAnna[gene,colnames(distancesAnna)]=hammDist[indexAnna,colnames(distancesAnna)]
distancesSwift[gene,colnames(distancesSwift)]=hammDist[indexSwift,colnames(distancesSwift)]
}
write.table(distancesAnna,file=paste0(WD,"files/distance.matrix.per.gene.anna.txt"), col.names=T,row.names=T)
write.table(distancesSwift,file=paste0(WD,"files/distance.matrix.per.gene.swift.txt"), col.names=T,row.names=T)
distancesAnna=read.table(file=paste0(WD,"files/distance.matrix.per.gene.anna.txt"), header=T)
distancesSwift=read.table(file=paste0(WD,"files/distance.matrix.per.gene.swift.txt"), header=T)
birds48filename=paste0(WD,"files/48birds_ortholog.list.chi.anna.cpe.hum.finch")
birds=read.table(birds48filename, header=T,
colClasses=c("character","numeric","character","character",
"character","character","character"))
birds.2=birds[birds$anna!="-",]
birds.3=birds.2[birds.2$swift!="-",]
birds.4=unique(birds.3)
birds.5=birds.4[,c("anna","swift")]
birds.5$swift=paste0("Parent=",birds.5$swift,";")
birds.5$anna=paste0("Parent=Aan_", birds.5$anna,";")
annaGenes=unique(gffAnna$V9)
birds.6=birds.5[birds.5$anna %in% annaGenes, ]
birds.6$swift.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$swift,"_"), function(x){x[[2]]})), ";"))
birds.6$anna.plain=unlist(strsplit(unlist(lapply(strsplit(birds.6$anna,"_"), function(x){x[[2]]})), ";"))
smallbird=birds.6[birds.6$anna.plain %in% intersect(birds.6$anna.plain,geneList),]
distancesAnna=distancesAnna[smallbird$anna.plain,]
distancesSwift=distancesSwift[smallbird$anna.plain,]
rownames(distancesAnna)=smallbird$anna
rownames(distancesSwift)=smallbird$swift
samplesFilenameAnna=paste0(WD,"anna/files/samples.txt")
samplesFilenameSwift=paste0(WD,"swift/files/samples.txt")
sampleNamesAnna=unlist(read.table(samplesFilenameAnna, stringsAsFactors = F))
sampleNamesSwift=unlist(read.table(samplesFilenameSwift, stringsAsFactors = F))
coverageMatrixFileAnna=paste0(WD,"anna/files/coverage.matrix.per.gene.txt")
coverageMatrixFileSwift=paste0(WD,"swift/files/coverage.matrix.per.gene.txt")
coverageMatrixAnna=read.table(coverageMatrixFileAnna, header=T)/genesAnna$Size
coverageMatrixSwift=read.table(coverageMatrixFileSwift, header=T)/genesSwift$Size
rownames(coverageMatrixAnna)=genesAnna$Gene
rownames(coverageMatrixSwift)=genesSwift$Gene
#-------------------------------------------------------------------------------
cmswift=coverageMatrixSwift[smallbird$swift,]
cmanna=coverageMatrixAnna[smallbird$anna,]
cmanna=cbind(cmanna,AnnaBGI=rep(0,nrow(cmanna)), SwiftBGI=rep(0,nrow(cmanna)))
##
## Call:
## lm(formula = finalCoverageSwift ~ finalDistanceSwift)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.60 -33.14 -11.46 22.18 161.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.2915 0.9318 91.54 <2e-16 ***
## finalDistanceSwift -346.9787 12.2923 -28.23 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 46.16 on 25678 degrees of freedom
## Multiple R-squared: 0.0301, Adjusted R-squared: 0.03006
## F-statistic: 796.8 on 1 and 25678 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = finalCoverageAnna ~ finalDistanceAnna)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.17 -39.53 -13.07 26.76 181.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.0040 0.7353 91.12 <2e-16 ***
## finalDistanceAnna 431.0792 36.8383 11.70 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 54.25 on 24992 degrees of freedom
## Multiple R-squared: 0.005449, Adjusted R-squared: 0.00541
## F-statistic: 136.9 on 1 and 24992 DF, p-value: < 2.2e-16
Now, the plots below show the relation between phylogenetic distance and coverage, of two (2) samples from the Cocquettes, Brilliants, Mangoes and Hermits groups, following the ML tree from the Hummingbirds paper (F1,A). These samples correspond to one with high coverage and one with low coverage.
References